This week you are going to learn how to map your data. Continuing with our previous theme of international migration and moving into this week’s theme of the Great Migration, we will make two sets of maps of the United States for the censuses of 1900, 1930, 1960, and 1990. One set of maps will show the proportion of each state’s population that was born outside of the United States (foreign-born divided by total). The other set will show the proportion of each state’s population that was classified as African American (African American divided by total). By the time you finish this notebook, in addition to knowing how to visualize data in a map, you will also know how to join data frames and recode continuous variables into ordinal variables.
Start by installing and loading the libraries we will need (tidyverse, sf, tigris, viridis, cowplot) and reading in the data for this week. Remember to install any that you have not used previously. We will only be working with the 48 contiugous states, so remove records for Alaska and Hawaii and any rows where STATEFIP is missing (> 56).
library(tidyverse)
library(sf)
library(tigris)
library(viridis)
library(cowplot)
library(ggplot2)
ipums <- read_csv("wk8.csv") %>% filter(!STATEFIP %in% c(2,15) & STATEFIP <= 56)
Explore the data to see which variables and samples we are working with.
names(ipums)
[1] "YEAR" "STATEFIP" "PERWT" "RACE" "RACED" "BPL" "BPLD"
for (year in unique(ipums$YEAR)) {
print(year)
print(summary(filter(ipums, YEAR == year)$PERWT))
}
[1] 1900
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 100.0 100.0 100.1 101.0 101.0
[1] 1930
Min. 1st Qu. Median Mean 3rd Qu. Max.
65.40 99.65 100.75 100.90 101.95 278.85
[1] 1960
Min. 1st Qu. Median Mean 3rd Qu. Max.
99.00 99.00 100.00 99.61 100.00 100.00
[1] 1990
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 14.00 19.00 19.85 25.00 345.00
We will start by calculating the quantities we want to map: the percentage of each state’s population in each year that was born outside the United States and the percentage of each state’s population in each year that was classified as African American.
First we want to add to our ipums data frame two new variables: one called FBORN that indicates whether a given person was born outside the United States, and another called AfAm that indicates whether a given person was classified as African American. These variables will take values of TRUE and FALSE. Test to make sure that you have constructed these variables correctly.
ipums <- ipums %>% mutate(FBORN = BPL >= 150, AfAm = RACE == 2)
table(ipums$BPL, ipums$FBORN)
FALSE TRUE
1 317645 0
2 8596 0
4 102760 0
5 201255 0
6 935904 0
8 147681 0
9 172918 0
10 33542 0
11 70979 0
12 308663 0
13 408094 0
15 16827 0
16 59883 0
17 802171 0
18 393839 0
19 269261 0
20 197091 0
21 316616 0
22 296673 0
23 90204 0
24 222383 0
25 411302 0
26 584105 0
27 310012 0
28 239891 0
29 390109 0
30 56138 0
31 141348 0
32 25717 0
33 54361 0
34 408572 0
35 83145 0
36 1246114 0
37 423677 0
38 69911 0
39 750849 0
40 225116 0
41 118591 0
42 968866 0
44 65783 0
45 233418 0
46 69191 0
47 340971 0
48 890553 0
49 104115 0
50 40623 0
51 335957 0
53 196097 0
54 186006 0
55 345754 0
56 27196 0
90 47 0
99 47917 0
100 487 0
105 2656 0
110 57812 0
115 1796 0
120 3213 0
150 0 78011
155 0 1
160 0 1648
199 0 79
200 0 230926
210 0 55642
250 0 37434
260 0 51184
300 0 50542
400 0 6000
401 0 3850
402 0 517
403 0 2
404 0 10647
405 0 16747
410 0 46192
411 0 13697
412 0 2235
413 0 5542
414 0 38912
419 0 4
420 0 3337
421 0 11281
422 0 7
423 0 282
424 0 15
425 0 8448
426 0 5050
429 0 2
430 0 466
431 0 2
432 0 40
433 0 11966
434 0 65834
435 0 508
436 0 13276
437 0 21
438 0 6064
439 0 1
450 0 13970
451 0 585
452 0 13104
453 0 108840
454 0 11831
455 0 41879
456 0 6672
457 0 10158
458 0 101
459 0 2
460 0 629
461 0 1963
462 0 4702
463 0 4
465 0 37760
499 0 355
500 0 46137
501 0 20470
502 0 29921
510 0 24
511 0 5733
512 0 2344
513 0 8152
514 0 1451
515 0 45184
516 0 669
517 0 5491
518 0 26081
519 0 12
520 0 1312
521 0 28056
522 0 10076
524 0 102
530 0 66
531 0 480
532 0 1928
534 0 5462
535 0 1457
536 0 384
537 0 4429
538 0 26
539 0 38
540 0 800
541 0 2589
542 0 4162
543 0 91
544 0 227
545 0 28
546 0 17
547 0 32
548 0 258
549 0 16
599 0 307
600 0 16890
700 0 3799
710 0 2283
800 0 2
900 0 45221
table(ipums$RACE, ipums$AfAm)
FALSE TRUE
1 13582101 0
2 0 1696059
3 118369 0
4 78039 0
5 33416 0
6 205511 0
7 468080 0
Now create a two new data frames, one called fb and one called aa. Each data frame will have one row for each state in each year. The fb data frame will include a column called FB that indicates the percent of the state’s population that was born outside the United States. The aa data frame will include a column called AA that indicates the percent of the state’s population that was classified as African American.
fb <- ipums %>% group_by(STATEFIP,YEAR) %>% count(FBORN,wt=PERWT) %>% mutate(TOTAL = sum(n)) %>% filter(FBORN) %>% mutate(FB = n/TOTAL) %>% select(-FBORN,-n,-TOTAL) %>% ungroup
Error: Must group by variables found in `.data`.
* Column `FBORN` is not found.
Run `rlang::last_error()` to see where the error occurred.
To map data, we need to start with a data frame that contains the outline of the geography we are working with. In this case, we want a map that outlines the states of the United States. That map exists in the tigris package, and we can use the states() function to get it. We specify that we want the data frame to be an object of class sf, which stands for simple features, a standard for geographic boundary files. We will call the new data frame states.
states <- states(class = "sf")
|
| | 0%
|
|= | 0%
|
|= | 1%
|
|== | 1%
|
|== | 2%
|
|=== | 3%
|
|==== | 3%
|
|==== | 4%
|
|===== | 4%
|
|===== | 5%
|
|====== | 5%
|
|======= | 6%
|
|======== | 7%
|
|========= | 8%
|
|========== | 9%
|
|=========== | 9%
|
|=========== | 10%
|
|============ | 10%
|
|============ | 11%
|
|============= | 11%
|
|============= | 12%
|
|============== | 12%
|
|=============== | 13%
|
|================ | 14%
|
|================= | 15%
|
|================== | 16%
|
|=================== | 17%
|
|==================== | 17%
|
|==================== | 18%
|
|===================== | 19%
|
|====================== | 19%
|
|====================== | 20%
|
|======================= | 20%
|
|======================== | 21%
|
|========================= | 22%
|
|========================== | 23%
|
|=========================== | 24%
|
|============================ | 24%
|
|============================ | 25%
|
|============================= | 26%
|
|============================== | 26%
|
|=============================== | 27%
|
|================================ | 28%
|
|================================= | 29%
|
|================================== | 30%
|
|=================================== | 30%
|
|=================================== | 31%
|
|==================================== | 31%
|
|==================================== | 32%
|
|===================================== | 33%
|
|====================================== | 33%
|
|======================================= | 34%
|
|======================================= | 35%
|
|======================================== | 35%
|
|========================================= | 36%
|
|========================================== | 37%
|
|=========================================== | 38%
|
|============================================ | 38%
|
|============================================= | 39%
|
|============================================= | 40%
|
|============================================== | 40%
|
|=============================================== | 41%
|
|================================================ | 42%
|
|================================================ | 43%
|
|================================================= | 43%
|
|================================================== | 44%
|
|=================================================== | 45%
|
|==================================================== | 45%
|
|==================================================== | 46%
|
|===================================================== | 46%
|
|====================================================== | 47%
|
|====================================================== | 48%
|
|======================================================= | 48%
|
|======================================================== | 49%
|
|========================================================= | 50%
|
|========================================================== | 51%
|
|=========================================================== | 52%
|
|============================================================ | 52%
|
|============================================================ | 53%
|
|============================================================= | 53%
|
|============================================================= | 54%
|
|============================================================== | 55%
|
|=============================================================== | 55%
|
|================================================================ | 56%
|
|================================================================= | 57%
|
|================================================================== | 58%
|
|=================================================================== | 58%
|
|=================================================================== | 59%
|
|==================================================================== | 60%
|
|===================================================================== | 61%
|
|====================================================================== | 61%
|
|====================================================================== | 62%
|
|======================================================================= | 62%
|
|======================================================================== | 63%
|
|======================================================================== | 64%
|
|========================================================================= | 64%
|
|========================================================================== | 65%
|
|=========================================================================== | 66%
|
|============================================================================ | 66%
|
|============================================================================ | 67%
|
|============================================================================= | 68%
|
|============================================================================== | 68%
|
|=============================================================================== | 69%
|
|================================================================================ | 70%
|
|================================================================================ | 71%
|
|================================================================================= | 71%
|
|================================================================================== | 72%
|
|=================================================================================== | 73%
|
|==================================================================================== | 73%
|
|==================================================================================== | 74%
|
|===================================================================================== | 74%
|
|===================================================================================== | 75%
|
|====================================================================================== | 75%
|
|======================================================================================= | 76%
|
|======================================================================================== | 77%
|
|======================================================================================== | 78%
|
|========================================================================================= | 78%
|
|========================================================================================== | 79%
|
|=========================================================================================== | 80%
|
|============================================================================================ | 81%
|
|============================================================================================= | 82%
|
|============================================================================================== | 82%
|
|============================================================================================== | 83%
|
|=============================================================================================== | 83%
|
|=============================================================================================== | 84%
|
|================================================================================================ | 84%
|
|================================================================================================= | 85%
|
|================================================================================================== | 86%
|
|=================================================================================================== | 87%
|
|==================================================================================================== | 88%
|
|===================================================================================================== | 88%
|
|===================================================================================================== | 89%
|
|====================================================================================================== | 89%
|
|====================================================================================================== | 90%
|
|======================================================================================================= | 90%
|
|======================================================================================================== | 91%
|
|========================================================================================================= | 92%
|
|========================================================================================================== | 93%
|
|=========================================================================================================== | 94%
|
|============================================================================================================ | 95%
|
|============================================================================================================== | 96%
|
|============================================================================================================== | 97%
|
|=============================================================================================================== | 97%
|
|=============================================================================================================== | 98%
|
|================================================================================================================ | 98%
|
|================================================================================================================ | 99%
|
|================================================================================================================= | 99%
|
|==================================================================================================================| 100%
head(states)
Simple feature collection with 6 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -97.23909 ymin: 24.39631 xmax: -71.08857 ymax: 49.38448
Geodetic CRS: NAD83
REGION DIVISION STATEFP STATENS GEOID STUSPS NAME LSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT
1 3 5 54 01779805 54 WV West Virginia 00 G4000 A 62266231560 489271086 +38.6472854
2 3 5 12 00294478 12 FL Florida 00 G4000 A 138947364717 31362872853 +28.4574302
3 2 3 17 01779784 17 IL Illinois 00 G4000 A 143779863817 6215723896 +40.1028754
4 2 4 27 00662849 27 MN Minnesota 00 G4000 A 206230065476 18942261495 +46.3159573
5 3 5 24 01714934 24 MD Maryland 00 G4000 A 25151726296 6979340970 +38.9466584
6 1 1 44 01219835 44 RI Rhode Island 00 G4000 A 2677787140 1323663210 +41.5974187
INTPTLON geometry
1 -080.6183274 MULTIPOLYGON (((-81.74725 3...
2 -082.4091477 MULTIPOLYGON (((-86.38865 3...
3 -089.1526108 MULTIPOLYGON (((-91.18529 4...
4 -094.1996043 MULTIPOLYGON (((-96.78438 4...
5 -076.6744939 MULTIPOLYGON (((-77.45881 3...
6 -071.5272723 MULTIPOLYGON (((-71.7897 41...
We can plot this empty map with ggplot() using the geom_sf() function. The theme_map() layer from the cowplot package and the coord_sf(datum = NA) layer will remove the axes and gridlines.
ggplot(states) + geom_sf() +
theme_map() + coord_sf(datum = NA)
To place our data on this map, we need to use a join() function to link the states data frame to the data frames that contain the data we want to map.
There are four kinds of join() functions. To see how these work, run the following code block to make two separate data frames.
name <- c("Abby", "Carmen", "Diego", "Gertrude", "Hector", "Irene")
dogs <- c(2, 5, 3, 1, 1, 2)
dog_data <- data.frame(name, dogs)
name <- c("Bob", "Carmen", "Diego", "Evelyn", "Frank", "Hector")
cats <- c(2, 2, 3, 1, 1, 1)
cat_data <- data.frame(name, cats)
dog_data
cat_data
As you can see, we have six dog owners and six cat owners, and some of them are the same people. There are four different types of joins. In a left_join() and a right_join(), the first data frame listed is the left data frame and the second is the right data frame. In a left_join(), the final data frame will include all rows from the left data frame and only matching rows from the right data frame. In a right_join(), the final data frame will include all rows from the right data frame and only matching rows from the left data frame. So if we want a data frame that only includes dog owners but indicates both the number of dogs they have and the number of cats they have, we can use either a left_join() or a right_join().
dog_owners <- left_join(dog_data, cat_data)
Joining, by = "name"
dog_owners
dog_owners <- right_join(cat_data, dog_data)
Joining, by = "name"
dog_owners
By default, R will join on any column or set of columns that is identical between the two data frames. We can also specify how we want R to match up the data frames, which we will need to do when we join our data to our map.
If we want the equivalent for cat owners, we just reverse the order of the data frames.
cat_owners <- left_join(cat_data, dog_data)
Joining, by = "name"
cat_owners
cat_owners <- right_join(dog_data, cat_data)
Joining, by = "name"
cat_owners
The inner_join() function returns a data frame with only the matching rows of the two data frames that are joined. We would use this if we want a data frame including only people who own both dogs and cats.
dog_cat_owners <- inner_join(dog_data, cat_data)
Joining, by = "name"
dog_cat_owners
The full_join() function returns a data frame containing all rows of both data frames. We would use this to get a data frame containing everyone who owns either a dog or a cat or both.
pet_owners <- full_join(dog_data, cat_data)
Joining, by = "name"
pet_owners
All of these joins are dplyr functions, so we can also use pipes:
dog_owners <- dog_data %>% left_join(cat_data) # same as left_join(dog_data, cat_data)
Joining, by = "name"
cat_owners <- dog_data %>% right_join(cat_data) # same as right_join(dog_data, cat_data)
Joining, by = "name"
dog_cat_owners <- dog_data %>% inner_join(cat_data) # same as inner_join(dog_data, cat_data)
Joining, by = "name"
pet_owners <- dog_data %>% full_join(cat_data) # same as full_join(dog_data, cat_data)
Joining, by = "name"
pet_owners
dog_cat_owners
Now we can join our two data frames (fb and aa) to our map. We will use an inner join, so the result will include only rows that are common to both the data frame and the map. The column that is common to the data frames we are joining is the one with the state fips code. This is called STATEFIP in fb and aa and called STATEFP in states. The STATEFP variable in states is a character, but the STATEFIP variable in fb and aa is numeric. We will solve this incompatibility by first creating a new column in states called STATEFIP that is numeric. After we join, we need to use the st_as_sf() function from the sf package to restore the geographic properties of the resulting data frame.
states$STATEFIP <- as.numeric(states$STATEFP)
fb_map <- inner_join(fb, states) %>% st_as_sf()
Joining, by = "STATEFIP"
aa_map <- inner_join(aa, states) %>% st_as_sf()
Joining, by = "STATEFIP"
head(fb_map)
Simple feature collection with 6 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -114.8163 ymin: 30.14442 xmax: -84.88825 ymax: 37.00372
Geodetic CRS: NAD83
head(aa_map)
Simple feature collection with 6 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -114.8163 ymin: 30.14442 xmax: -84.88825 ymax: 37.00372
Geodetic CRS: NAD83
Now we can adjust the code from our empty state map to map values of FB and AA and facet by YEAR.
map_fb <- ggplot(fb_map,aes(fill=FB)) + geom_sf() +
theme_map() + coord_sf(datum = NA) + facet_wrap(vars(YEAR))
map_aa <- ggplot(aa_map,aes(fill=AA)) + geom_sf() +
theme_map() + coord_sf(datum = NA) + facet_wrap(vars(YEAR))
map_fb
map_aa
There are a few problems with these maps. First, lower values are mapping darker, which is the opposite of what we want. General principles of visual data literacy suggest that larger values should map darker. Second, the PCT variable is continuous, so the fill color is appearing on a continuous scale (changing gradually), which makes it hard to see what the actual values are. We can address these issues one at a time.
To change the colors to light –> dark, we can specify our own color ramp with the scale_fill_gradient() layer. The parameters are the color we want for the lowest value and the color we want for the highest value. Here we will use scale_fill_gradient() on the map showing percent of population classified as African American.
map_aa + scale_fill_gradient(low = "white", high = "darkslategray4")
Alternatively, the viridis package provides nice continuous color ramps for maps. The folloiwing code shows an application of the viridis package on the map showing percent of population born outside the United States.
map_fb + scale_fill_viridis(option = "magma")
You can use a number of premade color ramps through the viridis package. Here I am using “magma” but you can see others here: https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html
map_fb + scale_fill_viridis(option = "viridis")
The viridis package also allows you to specify whether smaller numbers should be lighter or darker: “direction = -1” allows me to make my smaller numbers lighter.
map_fb + scale_fill_viridis(option = "viridis", direction = -1)
Mapping on a continuous scale emphasizes similarity from state to state. To emphasize difference, we can map on a discrete scale. To do that, we will need to choose bins to sort our states into. This is similar to choosing binwidth when making a histogram. One way to choose bins is to say that we want x categories, with a roughly-equal number of states in each.
Here we will make 6 categories for the map of percent African American, but you can make any number of categories by changing the value of classes.
classes <- 6
labels <- c()
quantiles <- quantile(aa_map$AA, probs = seq(0, 1, 1/classes))
for (i in 1:length(quantiles)) {
labels <- c(labels, paste0(round(quantiles[i]*100, 1), "% - ", round(quantiles[i+1]*100, 1), "%"))
}
labels <- labels[-length(labels)]
aa_map$AAD <- cut(aa$AA, breaks = quantiles, labels = labels, include.lowest = TRUE)
ggplot(data = aa_map, aes(fill = AAD)) + geom_sf() +
theme_map(font_size = 70) + coord_sf(datum = NA) +
facet_wrap(vars(YEAR), ncol = 2) +
labs(title = "Percent of Population Classified as African American, by State",
fill = "Percent of State Population") +
scale_fill_viridis(option = "magma", discrete = T, direction = -1)
Alternatively, we can choose where we want the breaks to be and how we want to label them. Here we will do that for the foreign-born population.
breaks <- c(0, .015, .03, .06, .12, .24, 1)
labels <- c()
for (i in 1:length(breaks)) {
labels <- c(labels, paste0(breaks[i]*100, "% - ", breaks[i+1]*100, "%"))
}
labels <- labels[-length(labels)]
labels[length(labels)] <- paste0(">", breaks[length(breaks)-1] * 100, "%")
fb_map$FBD <- cut(fb$FB, breaks = breaks, labels = labels, include.lowest = TRUE)
ggplot(data = fb_map, aes(fill = FBD)) + geom_sf() +
theme_map(font_size = 70) + coord_sf(datum = NA) +
facet_wrap(vars(YEAR), ncol = 2) +
labs(title = "Percent of Population Born Outside the U.S., by State",
fill = "Percent of State Population") +
scale_fill_viridis(option = "magma", discrete = T, direction = -1)
You are now ready for Lab 8!